Search CORE

23 research outputs found

Optimizing Dynamic Time Warping’s Window Width for Time Series Data Mining Applications

Author: Bagnall Anthony
Dau Hoang Anh
Forestier Germain
Keogh Eamonn
Mueen Abdullah
Petitjean Francois
Silva Diego Furtado
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/04/2018
Field of study

Dynamic Time Warping (DTW) is a highly competitive distance measure for most time series data mining problems. Obtaining the best performance from DTW requires setting its only parameter, the maximum amount of warping (w). In the supervised case with ample data, w is typically set by cross-validation in the training stage. However, this method is likely to yield suboptimal results for small training sets. For the unsupervised case, learning via cross-validation is not possible because we do not have access to labeled data. Many practitioners have thus resorted to assuming that “the larger the better”, and they use the largest value of w permitted by the computational resources. However, as we will show, in most circumstances, this is a naïve approach that produces inferior clusterings. Moreover, the best warping window width is generally non-transferable between the two tasks, i.e., for a single dataset, practitioners cannot simply apply the best w learned for classification on clustering or vice versa. In addition, we will demonstrate that the appropriate amount of warping not only depends on the data structure, but also on the dataset size. Thus, even if a practitioner knows the best setting for a given dataset, they will likely be at a lost if they apply that setting on a bigger size version of that data. All these issues seem largely unknown or at least unappreciated in the community. In this work, we demonstrate the importance of setting DTW’s warping window width correctly, and we also propose novel methods to learn this parameter in both supervised and unsupervised settings. The algorithms we propose to learn w can produce significant improvements in classification accuracy and clustering quality. We demonstrate the correctness of our novel observations and the utility of our ideas by testing them with more than one hundred publicly available datasets. Our forceful results allow us to make a perhaps unexpected claim; an underappreciated “low hanging fruit” in optimizing DTW’s performance can produce improvements that make it an even stronger baseline, closing most or all the improvement gap of the more sophisticated methods proposed in recent years

Crossref

univOAK

University of East Anglia digital repository

Recommended from our members

The Swiss army knife of time series data mining: ten useful things you can do with the matrix profile and ten lines of code

Author: Almaslukh Abdulaziz
Dau Hoang Anh
Funning Gareth
Gharghabi Shaghayegh
Kamgar Kaveh
Keogh Eamonn
Mueen Abdullah
Shakibay Senobari Nader
Silva Diego Furtado
Yeh Chin-Chia Michael
Zhu Yan
Zimmerman Zachary
Publication venue: eScholarship, University of California
Publication date: 01/07/2020
Field of study

eScholarship - University of California

Class based Influence Functions for Error Detection

Author: Bui Nghi D. Q.
Dau Anh T. V.
Huu-Tien Dang
Nguyen Hieu Ngoc
Nguyen-Duc Thang
Thanh-Tung Hoang
Tran Quan Hung
Publication venue
Publication date: 02/05/2023
Field of study

Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information to improve the stability of IFs. Extensive experiments show that our modification significantly improves the performance and stability of IFs while incurring no additional computational cost.Comment: Thang Nguyen-Duc, Hoang Thanh-Tung, and Quan Hung Tran are co-first authors of this paper. 12 pages, 12 figures. Accepted to ACL 202

arXiv.org e-Print Archive

The UCR Time Series Archive

Author: Bagnall Anthony
Dau Hoang Anh
Gharghabi Shaghayegh
Kamgar Kaveh
Keogh Eamonn
Ratanamahatan Chotirat Ann
Yeh Chin-Chia Michael
Zhu Yan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 08/09/2019
Field of study

The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a fraction might be mis-attributing the reasons for their improvement. Moreover, the improvements claimed by these papers might have been achievable with a much simpler modification, requiring just a few lines of code

arXiv.org e-Print Archive

University of East Anglia digital repository

Merosesquiterpenes from marine sponge Smenospongia cerebriformis

Author: Dau Nguyen Van
Huyen Le Thi
Kiem Phan Van
Minh Chau Van
Nhiem Nguyen Xuan
Tai Bui Huu
Thuy Hang Dan Thi
Tuan Anh Hoang Le
Yen Pham Hai
Publication venue: 'Publishing House for Science and Technology, Vietnam Academy of Science and Technology'
Publication date: 28/04/2017
Field of study

Using various chromatography methods, three merosesquiterpenes belonging to sesquiterpene quinone type, neodactyloquinone (1), dactyloquinone D (2), and dactyloquinone C (3) together with two indole derivatives indole-3-aldehyde (4) and indole-3-cacboxylic methyl ester (5) were isolated from the methanol extract of the Vietnamese marine sponge Smenospongia cerebriformis. Their structures were determined by 1D-, 2D-NMR spectra, HR-ESI-MS and in comparison with those reported in the literature. Keywords. Smenospongia cerebriformis, merosesquiterpene, sesquiterpene quinone, indole derivative

Vietnam Academy of Science and Technology: Journals Online

Recommended from our members

Towards More Accurate Time Series Data Mining by Constraining Model's Flexibility

Author: Dau Hoang Anh
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

This dissertation is motivated from enabling various tasks in large scale data mining of time series to produce more accurate, reproducible results and tailored to user’s specific need when that is favored. To that end, we have explored and contributed to the literature in three parts; each touches an active area of research and unifies under a common theme, reducing errors in time series data mining by learning constraints on model’s flexibility.The first body of work concerns Dynamic Time Warping (DTW), a highly competitive distance measure for most time series data mining problems. Obtaining the best performance from DTW requires setting its only parameter, the maximum amount of warping. This parameter gives DTW the flexibility to deal with data that can be locally out of phase, however the DTW algorithm sometimes exploits this flexibility to give pathological and unwanted results. We demonstrate the importance of setting DTW’s warping window width correctly, to constrain this flexibility, and we propose novel methods to learn this parameter in both supervised and unsupervised settings.The second body of work concerns time series motif discovery, perhaps the most used primitive for time series data mining. We point out that the current definitions of motif discovery are limited and can create a mismatch between the user’s intent/expectations, and the motif discovery search outcomes. We explain the reasons behind these issues and introduce a novel and general framework to address them.The last body of work concerns making more time series data sets and baseline results publicly available for gauging progress and comparison of rival approaches in spirit of reproducible research. We work on expanding the UCR Time Series Archive, an important resource in the time series data mining community, from 85 data sets since the last Fall 2015 release to 128 data sets in Fall 2018. Creating benchmark results for this archive required 61,041,100,000,000 DTW comparisons, greatly more than the number of DTW comparisons that have appeared in all research papers combined. Beyond expanding this valuable resource, we offer pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive

eScholarship - University of California

Recommended from our members

Towards More Accurate Time Series Data Mining by Constraining Model's Flexibility

Author: Dau Hoang Anh
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

eScholarship - University of California